Urban Heat Island Predictor

Spring 2025 Data Science Project
Team: Nithin Nambi, Aditya Koul, Arnav K, Archer Sariscak

Contributions

Nithin Nambi – I wrote the header, came up with the topic idea, wrote the introduction, completed checkpoint 1, wrote the building footprint analysis in data exploration and the graphs along with that.

Aditya Koul – I performed the exploratory data analysis and visualization of the UHI data. I also picked which of the three datasets to use for learning and what ML model to use. I created the model and wrote a few conclusions and results after completing training. Finally, I converted the notebook to HTML and uploaded it to my GitHub Pages site for hosting.

Arnav K – I performed the data curation of NY Mesonet weather data. I documented the data sources and transformation processes for all three datasets. For weather analysis, I compared temperature patterns between Bronx and Manhattan stations, creating statistical tests and visualizations.

Archer Sariscak – I created the visualization of the model accuracy. Additionally, I wrote the section explaining how someone else could use the model with example data. Finally, I wrote the conclusion and insights that we found through our model and analysis.

Introduction

Anyone who’s spent a summer night in Manhattan knows how the city can feel like a giant oven, thanks to the Urban Heat Island effect, urban spots end up much hotter than the countryside. If we don’t do something about UHIs, heat waves get nastier, our power grids take a large hit due to the extra load, and people with health issues face real danger.

For this project, we’re building a model to predict ground temperatures across NYC by combining four key data sources:

Ground readings from July 24, 2021

Sentinel-2 satellite imagery

Building‐footprint geometry

High-resolution weather observations

Our main questions are:

How close can we get to the actual ground temperatures using these inputs?

Which factors really drive the UHI effect at the neighborhood level?

These answers are very important because city planners, public‐health teams, and energy managers need accurate forecasts to zero in on cooling interventions, optimize building designs, and brace for extreme-heat events. Plus, our work plugs into an industry challenge on AI for urban heat mapping. So we’re both solving a real‐world problem and pushing the research on climate-resilient cities forward.

Data Curation/Preprocessing¶

Our project examines the Urban Heat Island (UHI) effect in New York City using three complementary datasets from the EY Open Science AI and Data Challenge (https://challenge.ey.com/challenges/the-2025-ey-open-science-ai-and-data-challenge-cooling-urban-heat-islands-external-participants/data-description).

The Building Footprint Data, provided by NYC's Office of Technology and Innovation, contains polygonal outlines of 9,436 buildings in our study area. This spatial data helps us understand how urban density impacts local temperatures. Structures absorb heat differently than vegetation, creating heat pockets that contribute to the UHI effect we're studying.

Our weather data comes from the New York State Mesonet network, with measurements from stations in both the Bronx and Manhattan. Collected at 5-minute intervals throughout July 24, 2021 (from 6:00 AM to 8:00 PM), it includes essential meteorological variables like air temperature, humidity, wind conditions, and solar radiation. Having data from two locations with different urban characteristics allows us to directly observe temperature differentials.

The UHI Index Training Data, developed by CAPA Strategies, provides our ground measurements of heat intensity. This dataset contains 11,229 georeferenced temperature readings collected between 3-4 PM (15:01-15:59) on July 24, 2021, covering upper Manhattan and parts of the Bronx. These measurements serve as our target variable for modeling neighborhood-level heat patterns.

In the code below, we transform these datasets into analysis-ready formats:

For building data, we convert coordinates to metric units (EPSG:3395) and calculate areas to quantify urban density.

The weather data undergoes timestamp standardization (removing EDT indicators and converting to datetime objects) and hourly aggregation to reveal temporal patterns in temperature differences between the Bronx and Manhattan.

For the UHI data, we convert string timestamps to proper datetime objects and analyze the distribution of temperature values across the city, with values ranging from 0.956 to 1.046 (mean: 1.000001).

These preprocessing steps allow us to investigate how urban structure and weather conditions contribute to temperature variations in different neighborhoods.

In [34]:
# Import statements
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# File paths
building_footprint_kml = '../training_data/Building_Footprint.kml'
NY_weather_xlsx = '../training_data/NY_Mesonet_Weather.xlsx'
UHI_data_csv = '../training_data/Training_data_uhi_index_2025-02-18.csv'

Exploratory Data Analysis¶

Building Footprint Data Analysis¶

Data Exploration and Summary Statistics

In [35]:
import geopandas as gpd

gdf = gpd.read_file(building_footprint_kml, driver='KML')
df = pd.DataFrame(gdf)
print(df.head())
# In the first few rows, we can see that the Name and Description columns are all the same. 
# The geometry column contains geometrical data representing the building footprints, in the form of MULTIPOLYGON.
  Name Description                                           geometry
0                   MULTIPOLYGON (((-73.91903 40.8482, -73.91933 4...
1                   MULTIPOLYGON (((-73.92195 40.84963, -73.92191 ...
2                   MULTIPOLYGON (((-73.9205 40.85011, -73.92045 4...
3                   MULTIPOLYGON (((-73.92056 40.8514, -73.92053 4...
4                   MULTIPOLYGON (((-73.91234 40.85218, -73.91247 ...
In [36]:
print(df.dtypes)
# The Name and Description columns are objects (strings), while geometry is a geometry type, 
# which is expected as this column is geospatial data (polygonal geometries of the buildings).
Name             object
Description      object
geometry       geometry
dtype: object
In [37]:
print(df.describe(include=[object]))
# With describe we are able to see that the Name and Description columns each have only one unique value. 
# Since every entry is the same, we can ignore these columns for most types of analysis.
        Name Description
count   9436        9436
unique     1           1
top                     
freq    9436        9436
In [38]:
print(df.info())
# The dataset contains 9436 entries and 3 columns. All columns are non-null, 
# meaning there are no missing values in any of them. This is good because it ensures 
# that our analysis will not be affected by missing data.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9436 entries, 0 to 9435
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   Name         9436 non-null   object  
 1   Description  9436 non-null   object  
 2   geometry     9436 non-null   geometry
dtypes: geometry(1), object(2)
memory usage: 221.3+ KB
None
In [39]:
missing_data = df.isna().sum()
print("Missing data:", missing_data)
# There is no missing data in the dataset, as expected, since all columns show 0 missing values.
# This indicates that we have a clean dataset in terms of completeness.
Missing data: Name           0
Description    0
geometry       0
dtype: int64
In [40]:
gdf = gdf.to_crs(epsg=3395) #EPSG:3395 uses meters as units
gdf['area'] = gdf.geometry.area
print(gdf[['Name', 'area']].head())
# The areas of the buildings vary significantly, with values like 1080.60 m², 166.11 m², and so on.
# This suggests that the dataset includes buildings of different sizes.
  Name         area
0       1080.601783
1        166.114638
2        246.325998
3        138.914032
4        376.844794
In [41]:
gdf['area'].max()
# The building with the largest are is 124277 m^2
Out[41]:
124277.6496267651

Conclusion: From the initial exploration, we found that the dataset contains only a small amount of variation in Name and Description. These columns don't contribute much to further analysis. We also confirmed that there is no missing data. Now, we should focus on exploring the geometrical data (building footprints), which likely holds the most relevant information for further analysis.

Hypothesis Testing

In [42]:
mean_area = gdf['area'].mean()
print(f"Mean area of the buildings: {mean_area} square meters")
# The mean area of the buildings is approximately 3479.11 m².
# This provides a reference value for our hypothesis testing.
Mean area of the buildings: 3479.1095318903926 square meters
In [43]:
from scipy import stats

t_area = 3500
t_stat, p_value = stats.ttest_1samp(gdf['area'], t_area)
print(f"T-statistic: {t_stat}, P-value: {p_value}")
T-statistic: -0.36877837220982873, P-value: 0.7123012012019707
In [44]:
if p_value < 0.05: # Did it like this because I played around with the test area t_area.
    print(f"We reject the null hypothesis: The average area is significantly different from {t_area} square meters.")
else:
    print(f"We fail to reject the null hypothesis: The average area is not significantly different from {t_area} square meters.")

# The t-statistic is -0.37, and the p-value is 0.71.
# The t-statistic indicates that the sample mean is slightly less than the threshold.
# However, the p-value is much greater than 0.05, which means we fail to reject the null hypothesis.
We fail to reject the null hypothesis: The average area is not significantly different from 3500 square meters.

Plots

In [45]:
plt.figure(figsize=(10, 6))
gdf['area'].hist(bins=100, edgecolor='black')
plt.title('Distribution of Building Footprint Areas')
plt.xlabel('Area (meters squared)')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

# The large peak on the left indicates that most buildings in NYC are very small.
# The long tail to the right shows that there are some buildings that are much larger, but there are relatively few.
# This suggests that the data has a few large outliers but the majority of buildings are relatively small in terms of area.
No description has been provided for this image
In [46]:
plt.figure(figsize=(8, 6))
gdf['area'].plot(kind='box', vert=False, color='lightcoral')
plt.title('Boxplot of Building Footprint Areas')
plt.xlabel('Area (meters squared)')
plt.show()

# The boxplot provides a visualization of the spread and outliers in the building area data. 
# The central 50% of the building areas are in between the lower and upper quartiles,
# with a lots of outliers visible on the upper end of the plot. 
# These outliers represent buildings with areas that are larger than the median, 
# and could represent very large buildings or errors in the data. 
# The boxplot highlights the presence of extreme values, which could influence the overall statistics.
No description has been provided for this image

NY Weather Data Analysis¶

In [47]:
# First, we will load in the NY Mesonet Weather data from the Excel file
# and do some basic data exploration to understand what we're working with
import seaborn as sns
from scipy import stats
import matplotlib.dates as mdates
from datetime import datetime

# Set the style for plots - using seaborn for nicer visualizations
plt.style.use('seaborn-v0_8')
sns.set(font_scale=1.2)
plt.rcParams['figure.figsize'] = (12, 7)

# Read the Excel file containing data for both Bronx and Manhattan
NY_weather_xlsx = '../training_data/NY_Mesonet_Weather.xlsx'
bronx_df = pd.read_excel(NY_weather_xlsx, sheet_name="Bronx")
manhattan_df = pd.read_excel(NY_weather_xlsx, sheet_name="Manhattan")

# Let's look at the basic info to understand what we're working with
print("Dataset Information:")
print(f"Bronx data: {bronx_df.shape[0]} rows, {bronx_df.shape[1]} columns")
print(f"Manhattan data: {manhattan_df.shape[0]} rows, {manhattan_df.shape[1]} columns")

print("\nBronx data - first 5 rows:")
print(bronx_df.head())
Dataset Information:
Bronx data: 169 rows, 6 columns
Manhattan data: 169 rows, 6 columns

Bronx data - first 5 rows:
               Date / Time  Air Temp at Surface [degC]  \
0  2021-07-24 06:00:00 EDT                        19.3   
1  2021-07-24 06:05:00 EDT                        19.4   
2  2021-07-24 06:10:00 EDT                        19.3   
3  2021-07-24 06:15:00 EDT                        19.4   
4  2021-07-24 06:20:00 EDT                        19.4   

   Relative Humidity [percent]  Avg Wind Speed [m/s]  \
0                         88.2                   0.8   
1                         87.9                   0.8   
2                         87.6                   0.7   
3                         87.4                   0.5   
4                         87.0                   0.2   

   Wind Direction [degrees]  Solar Flux [W/m^2]  
0                       335                  12  
1                       329                  18  
2                       321                  25  
3                       307                  33  
4                       301                  42  
In [48]:
# We rename the columns to make them easier to work 
# with throughout the analysis
column_mapping = {
    "Date / Time": "datetime",
    "Air Temp at Surface [degC]": "temperature",
    "Relative Humidity [percent]": "humidity",
    "Avg Wind Speed [m/s]": "wind_speed",
    "Wind Direction [degrees]": "wind_direction",
    "Solar Flux [W/m^2]": "solar_flux"
}

bronx_df = bronx_df.rename(columns=column_mapping)
manhattan_df = manhattan_df.rename(columns=column_mapping)

# Looking at the datetime column, we need to convert it to actual datetime objects
# This will make time-based analysis much easier
bronx_df['datetime'] = pd.to_datetime(bronx_df['datetime'].str.replace(' EDT', ''))
manhattan_df['datetime'] = pd.to_datetime(manhattan_df['datetime'].str.replace(' EDT', ''))

# We extract just the hour information for hourly analysis
bronx_df['hour'] = bronx_df['datetime'].dt.hour
manhattan_df['hour'] = manhattan_df['datetime'].dt.hour

# Add location identifier to each dataframe so we know which is which
bronx_df['location'] = 'Bronx'
manhattan_df['location'] = 'Manhattan'
In [49]:
# Now we check for any missing values or other data quality issues,
# and look at the timespan of our data
# Create a combined dataset for easier comparison later
combined_df = pd.concat([bronx_df, manhattan_df], ignore_index=True)

# Check for missing values
print("\nMissing Values Check:")
print(f"Bronx: {bronx_df.isnull().sum().sum()} missing values")
print(f"Manhattan: {manhattan_df.isnull().sum().sum()} missing values")

# Get time range of the data
earliest = bronx_df['datetime'].min()
latest = bronx_df['datetime'].max()
print("\nTime Range:")
print(f"Earliest datetime: {earliest}")
print(f"Latest datetime: {latest}")

# We have no missing values, and the data covers about a 14-hour period
# on July 24, 2021, from around 6 AM to 8 PM. 
Missing Values Check:
Bronx: 0 missing values
Manhattan: 0 missing values

Time Range:
Earliest datetime: 2021-07-24 06:00:00
Latest datetime: 2021-07-24 20:00:00
In [50]:
# Now let's analyze the Urban Heat Island effect by comparing temperatures
# between Manhattan and the Bronx throughout the day

# First, we group by hour and calculate the average temperature for each location
bronx_hourly = bronx_df.groupby('hour')['temperature'].mean()
manhattan_hourly = manhattan_df.groupby('hour')['temperature'].mean()

# Let's look at the hourly temperature values to understand the differences
print("\nHourly temperature values:")
for hour in range(6, 21):
    if hour in bronx_hourly.index and hour in manhattan_hourly.index:
        diff = manhattan_hourly[hour] - bronx_hourly[hour]
        print(f"Hour {hour}: Bronx {bronx_hourly[hour]:.2f}°C, Manhattan {manhattan_hourly[hour]:.2f}°C, " +
            f"Diff: {diff:.2f}°C")

# We can clearly see that Manhattan is warmer in the morning hours, and then
# the Bronx becomes warmer in the afternoon.
Hourly temperature values:
Hour 6: Bronx 19.39°C, Manhattan 21.69°C, Diff: 2.30°C
Hour 7: Bronx 20.09°C, Manhattan 22.50°C, Diff: 2.41°C
Hour 8: Bronx 21.67°C, Manhattan 23.44°C, Diff: 1.77°C
Hour 9: Bronx 23.61°C, Manhattan 24.36°C, Diff: 0.75°C
Hour 10: Bronx 24.85°C, Manhattan 24.88°C, Diff: 0.03°C
Hour 11: Bronx 25.86°C, Manhattan 25.48°C, Diff: -0.38°C
Hour 12: Bronx 26.46°C, Manhattan 26.34°C, Diff: -0.12°C
Hour 13: Bronx 26.94°C, Manhattan 27.23°C, Diff: 0.29°C
Hour 14: Bronx 27.48°C, Manhattan 27.31°C, Diff: -0.18°C
Hour 15: Bronx 27.52°C, Manhattan 26.73°C, Diff: -0.78°C
Hour 16: Bronx 26.86°C, Manhattan 26.84°C, Diff: -0.02°C
Hour 17: Bronx 26.09°C, Manhattan 25.72°C, Diff: -0.37°C
Hour 18: Bronx 25.22°C, Manhattan 25.18°C, Diff: -0.04°C
Hour 19: Bronx 25.01°C, Manhattan 25.10°C, Diff: 0.09°C
Hour 20: Bronx 24.90°C, Manhattan 24.60°C, Diff: -0.30°C
In [51]:
# Let's calculate some key metrics to quantify the heat island effect

# Diurnal temperature range (DTR) is the difference between daily max and min temps
bronx_dtr = bronx_hourly.max() - bronx_hourly.min()
manhattan_dtr = manhattan_hourly.max() - manhattan_hourly.min()

# For the analysis, we want to look at different periods throughout the day
# Morning (6-9 AM), Midday (10 AM - 1 PM), Afternoon (2-5 PM), Evening (6-8 PM)

# Define hour ranges for each period
morning_hours = range(6, 10)
midday_hours = range(10, 14)
afternoon_hours = range(14, 18)
evening_hours = range(18, 21)

# Filter data for each period
bronx_morning = bronx_df[bronx_df['hour'].isin(morning_hours)]
manhattan_morning = manhattan_df[manhattan_df['hour'].isin(morning_hours)]

bronx_midday = bronx_df[bronx_df['hour'].isin(midday_hours)]
manhattan_midday = manhattan_df[manhattan_df['hour'].isin(midday_hours)]

bronx_afternoon = bronx_df[bronx_df['hour'].isin(afternoon_hours)]
manhattan_afternoon = manhattan_df[manhattan_df['hour'].isin(afternoon_hours)]

bronx_evening = bronx_df[bronx_df['hour'].isin(evening_hours)]
manhattan_evening = manhattan_df[manhattan_df['hour'].isin(evening_hours)]

# Calculate mean temperatures for each period
morning_diff = manhattan_morning['temperature'].mean() - bronx_morning['temperature'].mean()
midday_diff = manhattan_midday['temperature'].mean() - bronx_midday['temperature'].mean()
afternoon_diff = manhattan_afternoon['temperature'].mean() - bronx_afternoon['temperature'].mean()
evening_diff = manhattan_evening['temperature'].mean() - bronx_evening['temperature'].mean()

# Now we verify our calculations to make sure they're correct
print("\nTemperature differences by time period (Manhattan - Bronx):")
print(f"Morning (6-9 AM): {morning_diff:.2f}°C")
print(f"Midday (10 AM - 1 PM): {midday_diff:.2f}°C")
print(f"Afternoon (2-5 PM): {afternoon_diff:.2f}°C")
print(f"Evening (6-8 PM): {evening_diff:.2f}°C")
Temperature differences by time period (Manhattan - Bronx):
Morning (6-9 AM): 1.81°C
Midday (10 AM - 1 PM): -0.04°C
Afternoon (2-5 PM): -0.34°C
Evening (6-8 PM): 0.01°C
In [52]:
# We run a statistical test to see if the morning temperature difference
# is statistically significant. We'll use a two-sample t-test.

t_stat_morning, p_value_morning = stats.ttest_ind(
    bronx_morning['temperature'],
    manhattan_morning['temperature'],
    equal_var=False  # Using Welch's t-test since variances might differ
)

print("\n=== Urban Heat Island Effect Analysis ===")
print(f"Bronx diurnal temperature range: {bronx_dtr:.2f}°C")
print(f"Manhattan diurnal temperature range: {manhattan_dtr:.2f}°C")
print("\nStatistical test for morning temperature difference:")
print(f"t-statistic: {t_stat_morning:.4f}")
print(f"p-value: {p_value_morning:.10f}")

if p_value_morning < 0.05:
    print("Result: Statistically significant difference in morning temperatures (p < 0.05)")
    print(f"The morning temperature difference of +{morning_diff:.2f}°C is statistically significant!")
else:
    print("Result: No statistically significant difference in morning temperatures")

# The p-value is extremely small (much less than 0.05), which means
# the temperature difference we observed is very unlikely to be due to chance.
# This is strong evidence of the urban heat island effect.
=== Urban Heat Island Effect Analysis ===
Bronx diurnal temperature range: 8.12°C
Manhattan diurnal temperature range: 5.62°C

Statistical test for morning temperature difference:
t-statistic: -6.2925
p-value: 0.0000000163
Result: Statistically significant difference in morning temperatures (p < 0.05)
The morning temperature difference of +1.81°C is statistically significant!
In [53]:
# Now we'll create a visualization to show the urban heat island effect
# We want to make a figure that shows both the hourly temperatures and 
# the differences between locations

plt.figure(figsize=(14, 9))

# Main plot: Hourly temperatures
ax1 = plt.subplot2grid((3, 3), (0, 0), colspan=3, rowspan=2)
ax1.plot(bronx_hourly.index, bronx_hourly.values, 'o-', color='#1f77b4', linewidth=3, label='Bronx', markersize=8)
ax1.plot(manhattan_hourly.index, manhattan_hourly.values, 'o-', color='#ff7f0e', linewidth=3, label='Manhattan', markersize=8)

# Shade the morning period to highlight UHI effect
ax1.axvspan(6, 9, alpha=0.15, color='green', label='Morning Hours')

# Add annotations for key times
bronx_max_hour = bronx_hourly.idxmax()
manhattan_max_hour = manhattan_hourly.idxmax()

ax1.annotate(f'Peak: {bronx_hourly.max():.1f}°C',
            xy=(bronx_max_hour, bronx_hourly.max()),
            xytext=(bronx_max_hour+0.5, bronx_hourly.max()+0.5),
            arrowprops=dict(arrowstyle='->', color='#1f77b4'),
            color='#1f77b4')

ax1.annotate(f'Peak: {manhattan_hourly.max():.1f}°C',
            xy=(manhattan_max_hour, manhattan_hourly.max()),
            xytext=(manhattan_max_hour+0.5, manhattan_hourly.max()+0.5),
            arrowprops=dict(arrowstyle='->', color='#ff7f0e'),
            color='#ff7f0e')

# Annotate morning difference
ax1.annotate(f'Morning Difference: +{morning_diff:.2f}°C\n(p < 0.0001)',
            xy=(7.5, (bronx_hourly[7] + manhattan_hourly[7])/2),
            xytext=(7.5, (bronx_hourly[7] + manhattan_hourly[7])/2 - 1.5),
            ha='center',
            bbox=dict(boxstyle="round,pad=0.5", fc="white", ec="gray", alpha=0.8),
            arrowprops=dict(arrowstyle='->', color='black'))

# Add labels to main graph
ax1.set_title('Urban Heat Island Effect: Hourly Temperature Comparison', fontsize=18, fontweight='bold', pad=20)
ax1.set_xlabel('Hour of Day', fontsize=14)
ax1.set_ylabel('Average Temperature (°C)', fontsize=14)
ax1.set_xticks(range(6, 21))
ax1.grid(True, linestyle='--', alpha=0.7)
ax1.legend(fontsize=12)

# Subplot: Temperature difference
ax2 = plt.subplot2grid((3, 3), (2, 0), colspan=2)
temp_diff = manhattan_hourly - bronx_hourly
bars = ax2.bar(temp_diff.index, temp_diff.values, color=['green' if val > 0 else 'blue' for val in temp_diff])

# Color the bars by time period
for i, bar in enumerate(bars):
    hour = temp_diff.index[i]
    if 6 <= hour <= 9:  # Morning
        bar.set_color('#8cc751')  # Light green
    elif 10 <= hour <= 13:  # Midday
        bar.set_color('#f5b642')  # Light orange
    elif 14 <= hour <= 17:  # Afternoon
        bar.set_color('#f58d42')  # Darker orange
    else:  # Evening
        bar.set_color('#4286f5')  # Blue

ax2.axhline(y=0, color='black', linestyle='-', alpha=0.3)
ax2.set_xlabel('Hour of Day', fontsize=12)
ax2.set_ylabel('Temperature Difference\n(Manhattan - Bronx, °C)', fontsize=12)
ax2.set_xticks(range(6, 21))
ax2.grid(True, axis='y', linestyle='--', alpha=0.7)

# Subplot: Key statistics and findings
ax3 = plt.subplot2grid((3, 3), (2, 2))
ax3.axis('off')  # Turn off axis

# Add text box with key statistics
stats_text = (
    "Urban Heat Island Effect:\n"
    "-------------------------\n"
    f"Morning diff: +{morning_diff:.2f}°C\n"
    f"Midday diff: {midday_diff:.2f}°C\n"
    f"Afternoon diff: {afternoon_diff:.2f}°C\n"
    f"Evening diff: {evening_diff:.2f}°C\n\n"
    f"DTR Bronx: {bronx_dtr:.2f}°C\n"
    f"DTR Manhattan: {manhattan_dtr:.2f}°C\n\n"
    f"t-test: {t_stat_morning:.2f}\n"
    f"p-value: <0.0001"
)

ax3.text(0, 1, stats_text, fontsize=10, va='top',
         bbox=dict(boxstyle="round,pad=0.5", fc="#f0f0f0", ec="gray"))

plt.tight_layout()
plt.show()
No description has been provided for this image

Conclusion¶

Our testing indicates significant microclimatic differences between the Bronx and Manhattan. Manhattan exhibits classic urban heat island characteristics, with notably warmer morning temperatures (p < 0.0001) and reduced daily temperature fluctuations compared to the Bronx. This temperature difference is most pronounced during morning hours but diminishes throughout the day, even slightly reversing by afternoon. The data suggests that Manhattan's dense urban landscape retains more overnight heat, while the Bronx may experience greater daytime warming due to less building shade. These patterns align with established urban climatology research on how building density, construction materials, and vegetation coverage influence local temperature variations.

UHI Data Analysis¶

In [54]:
UHI_df = pd.read_csv(UHI_data_csv)
print(UHI_df.head())
   Longitude   Latitude          datetime  UHI Index
0 -73.909167  40.813107  24-07-2021 15:53   1.030289
1 -73.909187  40.813045  24-07-2021 15:53   1.030289
2 -73.909215  40.812978  24-07-2021 15:53   1.023798
3 -73.909242  40.812908  24-07-2021 15:53   1.023798
4 -73.909257  40.812845  24-07-2021 15:53   1.021634
In [55]:
UHI_df.dtypes
Out[55]:
Longitude    float64
Latitude     float64
datetime      object
UHI Index    float64
dtype: object
In [56]:
# We see that this CSV file has four features: Longitude, Latitude, datetime, and
# UHI Index. However, the datetime column does not contain datetime objects, so
# we must first convert the column to hold datetime objects.

UHI_df['datetime'] = pd.to_datetime(UHI_df['datetime'], format="%d-%m-%Y %H:%M")
UHI_df.dtypes
print(UHI_df.head())
   Longitude   Latitude            datetime  UHI Index
0 -73.909167  40.813107 2021-07-24 15:53:00   1.030289
1 -73.909187  40.813045 2021-07-24 15:53:00   1.030289
2 -73.909215  40.812978 2021-07-24 15:53:00   1.023798
3 -73.909242  40.812908 2021-07-24 15:53:00   1.023798
4 -73.909257  40.812845 2021-07-24 15:53:00   1.021634
In [57]:
# Looking through the data, I see that the dates from all of the data are similar.
# To make sure, I print the earliest and latest dates for the data below

earliest = UHI_df['datetime'].min()
latest = UHI_df['datetime'].max()

print("Earliest datetime:", earliest)
print("Latest datetime:", latest)
Earliest datetime: 2021-07-24 15:01:00
Latest datetime: 2021-07-24 15:59:00
In [58]:
# We see that all fo the data was gathered within an hour of each other, so this
# column provides no meaningful information. Therefore I will delete it, but we
# will also have to keep in mind that the UHI values for these latitudes and
# longitudes were recorded on July 24th, a very warm time of the year in New York

UHI_df = UHI_df.drop(columns=['datetime'])
print(UHI_df.head())
   Longitude   Latitude  UHI Index
0 -73.909167  40.813107   1.030289
1 -73.909187  40.813045   1.030289
2 -73.909215  40.812978   1.023798
3 -73.909242  40.812908   1.023798
4 -73.909257  40.812845   1.021634
In [59]:
print(UHI_df.count())
Longitude    11229
Latitude     11229
UHI Index    11229
dtype: int64
In [60]:
# With the code above, we see that the df has 11229 entries with each column also
# having 11229 entries. This is good because this means there are no missing values
# in the data. We will use the describe function to get a better understanding
# of the data below

summary = UHI_df.describe()
print(summary)
          Longitude      Latitude     UHI Index
count  11229.000000  11229.000000  11229.000000
mean     -73.933927     40.808800      1.000001
std        0.028253      0.023171      0.016238
min      -73.994457     40.758792      0.956122
25%      -73.955703     40.790905      0.988577
50%      -73.932968     40.810688      1.000237
75%      -73.909647     40.824515      1.011176
max      -73.879458     40.859497      1.046036

The descriptive statistics shows some interesting trends. The Latitude and Longitude values have a very narrow range which makes sense as the data is only focused on Manhattan. What we found surprising is that the UHI Index also had a relatively small range. It was centered at a UHI Index of almost exactly 1 (1.000001), and since the median is close to the mean, we can say that the UHI data is fairly symmetrically distributed.

In [61]:
# We can getter a better visual understanding of the data by plotting it. Below,
# we plot the Latitude and Longitude values to see where the data was taken from.
# To get a better understanding of the data in context, we plot it on top of a map
# of Manhattan

import folium
from IPython.display import HTML

# Create a map centered around Manhattan
manhattan_map = folium.Map(location=[40.81, -73.9402778], zoom_start=12)  # Center found with trial and error

# Add points to the map
for _, row in UHI_df.iterrows():
    folium.CircleMarker(
        location=[row['Latitude'], row['Longitude']],
        radius=5,
        color='blue',
        fill=True,
        fill_color='blue',
        fill_opacity=0.5
    ).add_to(manhattan_map)

iframe = manhattan_map._repr_html_()
display(HTML('<div style="width:800px; height:500px">{}</div>'.format(iframe)))

manhattan_map.save('maps/uhi_locations.html')
Make this Notebook Trusted to load map: File -> Trust Notebook

View Manhattan UHI Locations Map if Above is not Loading

The map shows that the locations that the data was taken was only in upper Manhattan. It started from West 57th Street (around the bottom of Central Park) and went up to into the Bronx. This is interesting, as none of Midtown or lower Manhattan was captured.

In [62]:
# We only plotted the Latitude and Longitude values above. There is no UHI metric
# shown, so the heatmap below shows the UHI indices for each Latitude and Longitude

from folium.plugins import HeatMap
from branca.colormap import LinearColormap


manhattan_map = folium.Map(location=[40.81, -73.9402778], zoom_start=12)

# Create a custom color map for the small range of UHI values
min_uhi = 0.956122
max_uhi = 1.046036

colors = ['blue', 'lightblue', 'yellow', 'orange', 'red']
colormap = LinearColormap(colors, vmin=min_uhi, vmax=max_uhi)
colormap.caption = 'UHI Index'

# Add the color map to the map
colormap.add_to(manhattan_map)

# Add points to the map with colors based on UHI Index
for _, row in UHI_df.iterrows():
    folium.CircleMarker(
        location=[row['Latitude'], row['Longitude']],
        radius=5,
        color=colormap(row['UHI Index']),  # Use the same color for border as fill
        weight=0,  # Set weight to 0 to remove the border
        fill=True,
        fill_color=colormap(row['UHI Index']),
        fill_opacity=0.7,
        popup=f"UHI Index: {row['UHI Index']:.6f}"
    ).add_to(manhattan_map)

iframe = manhattan_map._repr_html_()
display(HTML('<div style="width:800px; height:500px">{}</div>'.format(iframe)))

manhattan_map.save('maps/uhi_heatmap.html')

# Note: There is text below this section but sometimes when viewing in GitHub, 
# it does not show for some reason. To see the next section, download this notebook 
# and view locally.
Make this Notebook Trusted to load map: File -> Trust Notebook

View Manhattan UHI Heat Map if Above is not Loading

The map shows that areas around the Bronx and upper Manhattan have higher relative UHI indicies compared to the areas around central park. This makes sense as proximity to trees and other forms of urban nature (like parks, green roofs, and vegetation) helps reduce the Urban Heat Island effect in cities through several key mechanisms:

  • 🌳 Shade and Reduced Surface Temperatures
  • 💧 Evapotranspiration
  • 🌿 Reduced Heat Storage
  • 🌬️ Improved Air Circulation
  • 🏙️ Mitigation of Anthropogenic Heat

Primary Analysis¶

Dataset Selection¶

After reviewing the three candidate datasets explored above, the NY Mesonet Weather dataset (Bronx + Manhattan stations) provides the richest set of independent variables and the strongest theoretical link to the Urban Heat Island (UHI) phenomenon.

  • Relevance of features – Each record contains meteorological drivers of UHI intensity (surface air temperature, relative humidity, wind speed/direction, solar flux).
  • Temporal coverage – Continuous 5‑minute readings across a full summer day allow us to model diurnal variation instead of a single snapshot.
  • Spatial pairing – Measurements from two contrasting urban contexts (dense Manhattan vs. greener Bronx) collected at the same timestamps let us derive a direct target variable:

$ \text{UHI\_difference}(t)=T_{\text{Manhattan}}(t)\;-\;T_{\text{Bronx}}(t) $

The alternative datasets are less informative:

Dataset Limitations for ML modelling
Building Footprints Geometry only. No target variable or meteorological context.
Interpolated UHI index CSV Target available, but predictors limited to latitude/longitude (spatial only), offering little explanatory power.

Machine‑Learning Technique¶

We model $\text{UHI\_difference}$ with a Random Forest Regressor (ensemble of decision trees).

  • Handles non‑linear interactions (e.g., solar flux × wind speed) without manual feature engineering.
  • Naturally models feature importance, highlighting dominant physical drivers.
  • Performs well on small‑to‑medium tabular datasets and is less sensitive to hyper‑parameters than gradient‑boosting methods given our 169‑row sample.
  • Robust to multicollinearity and monotonic relationships.

Below we construct the feature set, train the model, evaluate performance, and inspect feature importances.

In [63]:
# Primary Analysis: Random-Forest on Weather-derived UHI difference
import pandas as pd
import re
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Load Mesonet sheets
weather_path = "../training_data/NY_Mesonet_Weather.xlsx"
bronx_raw     = pd.read_excel(weather_path, sheet_name="Bronx")
man_raw       = pd.read_excel(weather_path, sheet_name="Manhattan")

# 2. Clean & prefix-rename columns
def snakeify(txt: str) -> str:
    """Turn 'Air Temp at Surface [degC]' -> 'Air_Temp_at_Surface'."""
    txt_no_units = txt.split("[")[0].strip()        # drop bracket & units
    return re.sub(r"[^A-Za-z0-9]+", "_", txt_no_units)  # spaces & punctuation

def clean_cols(df: pd.DataFrame, prefix: str) -> pd.DataFrame:
    df = df.copy()

    # robust timestamp parse
    ts = (
        df["Date / Time"]
        .astype(str)
        .str.replace(r"\s+[A-Z]{2,4}$", "", regex=True)   # drop ' EDT' etc.
    )
    df["Date / Time"] = pd.to_datetime(ts, errors="coerce")
    df = df.dropna(subset=["Date / Time"]).reset_index(drop=True)

    # --- rename every non-timestamp column ---
    rename_map = {
        col: f"{prefix}_{snakeify(col)}"
        for col in df.columns if col != "Date / Time"
    }
    return df.rename(columns=rename_map)

bronx = clean_cols(bronx_raw,   "Bronx")
man   = clean_cols(man_raw,     "Man")

# 3. Merge on timestamp
merged = pd.merge(bronx, man, on="Date / Time", how="inner")

# 4. Identify the surface-temperature columns automatically
def find_temp_col(columns, city_prefix):
    """Return the column that ends with '_Air_Temp_at_Surface' (case-insensitive)."""
    for col in columns:
        if re.fullmatch(fr"{city_prefix}_Air_Temp_at_Surface", col, flags=re.I):
            return col
    raise KeyError(f"No surface-air-temperature column found for {city_prefix}")

bronx_temp = find_temp_col(merged.columns, "Bronx")
man_temp   = find_temp_col(merged.columns, "Man")

# 5. Target and feature set
merged["UHI_diff"] = merged[man_temp] - merged[bronx_temp]

feature_cols = [
    c for c in merged.columns
    if c not in ["Date / Time", "UHI_diff", bronx_temp, man_temp]
]

X = merged[feature_cols]
y = merged["UHI_diff"]

# 6. Train / test split & model
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42
)

rf = RandomForestRegressor(
    n_estimators=300,
    max_depth=None,
    random_state=42,
    oob_score=True,
    n_jobs=-1
)
rf.fit(X_train, y_train)

# 7. Evaluation
y_pred = rf.predict(X_test)
print("\nModel performance:")
print(f"Out-of-bag R²     : {rf.oob_score_:.3f}")
print(f"Test set MAE      : {mean_absolute_error(y_test, y_pred):.3f} °C")
print(f"Test set R²       : {r2_score(y_test, y_pred):.3f}")

# 8. Feature-importance plot
imp = (
    pd.Series(rf.feature_importances_, index=feature_cols)
      .sort_values(ascending=False)
)
plt.figure(figsize=(8, 6))
sns.barplot(x=imp.values, y=imp.index)
plt.title("Random-Forest feature importance")
plt.xlabel("Mean decrease in impurity")
plt.tight_layout()
plt.show()
Model performance:
Out-of-bag R²     : 0.870
Test set MAE      : 0.191 °C
Test set R²       : 0.958
No description has been provided for this image

Pipeline steps¶

  1. Load data

    • Read the Bronx and Manhattan sheets from NY_Mesonet_Weather.xlsx.
  2. Clean & rename columns

    • Strip units (e.g., [degC]), convert to snake_case, and prefix each header with the station name (Bronx_, Man_).
    • Parse the "Date / Time" field; drop any rows whose timestamps cannot be parsed.
  3. Merge stations

    • Inner-join the two data sets on the cleaned timestamp.
  4. Create target

    • UHI_diff = Man_Air_Temp_at_Surface − Bronx_Air_Temp_at_Surface.
  5. Build feature matrix

    • Remove the timestamp, both raw surface-temperature columns, and the new target column.
  6. Train / test split

    • 80 % training, 20 % testing (random_state = 42).
  7. Fit model

    • RandomForestRegressor with 300 trees, unlimited depth, OOB scoring on, all cores (n_jobs = -1).
  8. Evaluate

    • Report Out-of-bag R², Test MAE (°C), and Test R².
  9. Interpret

    • Plot mean-decrease-in-impurity feature importance to see which weather variables drive the Manhattan-Bronx temperature gap.

What the model tells us¶

  • Captures ~96 % of the variance in hourly UHI differences (Test R² ≈ 0.96) with an average absolute error of ~0.19 °C.
  • Highlights the meteorological features most strongly associated with urban heat-island intensity.
  • OOB score (≈ 0.87) offers a built-in cross-validation check for over-fitting.

Visualization¶

We can use a plot to visualize the accuracy of our model. Using a scatterplot we can check how closely our model's predictions match actual measured values. We can also create an "ideal prediction" line, which shows where points should be if the predicted values were the exact same as measured values. As we can see, our test samples are closely clustered around the "ideal prediction" line.

In [140]:
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, color='blue', label='Test samples')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color='red', label='Ideal predictions')
plt.title('Actual vs. Predicted UHI Difference between Bronx and Manhattan')
plt.xlabel('Actual UHI Difference (Degrees Celsius)')
plt.ylabel('Predicted UHI Difference (Degrees Celsius)')
plt.legend()
plt.show()
No description has been provided for this image

Example Usage of Model¶

In order to use our model with external data, we can first save it to a file with joblib. This makes it so that we can reload it later without retraining and also share it with others. Then we can create example external data to demonstrate how to use the model.

In [141]:
import joblib
joblib.dump(rf, 'rf_uhi.pkl')
Out[141]:
['rf_uhi.pkl']

First we load the model from the file.

Then we can create dictionaries that list the different features and their values for:

  • Relative Humidity (percent)
  • Avg Wind Speed (m/s)
  • Wind Direction (degrees)
  • Solar Flux (W/m^2)

Each of the column names use snake case and are prefixed by bronx or man indicating where the measurement was taken. Then we take the dictionaries and build a dataframe and select only the coulmns that our model is expecting (the variable, feature_cols, used earlier in primary analysis)

In [160]:
rf_load = joblib.load('rf_uhi.pkl')

ex1 = {
    'Bronx_Relative_Humidity': 90,
    'Bronx_Avg_Wind_Speed': 2.0,
    'Bronx_Wind_Direction': 35,
    'Bronx_Solar_Flux': 800,
    'Man_Relative_Humidity': 65,
    'Man_Avg_Wind_Speed': 1,
    'Man_Wind_Direction': 4,
    'Man_Solar_Flux': 10,
}

ex2 = {
    'Bronx_Relative_Humidity': 0,
    'Bronx_Avg_Wind_Speed': 2.0,
    'Bronx_Wind_Direction': 70,
    'Bronx_Solar_Flux': 900,
    'Man_Relative_Humidity': 65,
    'Man_Avg_Wind_Speed': 1,
    'Man_Wind_Direction': 4,
    'Man_Solar_Flux': 10,
}

df_ex = pd.DataFrame([ex1, ex2])
preds = rf_load.predict(df_ex[feature_cols])
print("Given this example data, this is what the model predicts")
print("Ex 1 predicted UHI diff:", preds[0].round(3), "degrees Celsius")
print("Ex 2 predicted UHI diff:", preds[1].round(3), "degrees Celsius")
Given this example data, this is what the model predicts
Ex 1 predicted UHI diff: 2.223 degrees Celsius
Ex 2 predicted UHI diff: -1.434 degrees Celsius

Conclusion¶

This analysis provides several different insights into the urban heat island effect and the model we created.

Our random forest model can predict the difference in temperatures well, capturing ~96 % of the variance with an average absolute error of ~0.19 degrees Celsius. Additionally, it lists Bronx relative humidity (~82% mean decrease in impurity), Manhattan relative humidity (~7% mean decrease in impurity), and Bronx Solar Flux (~3% mean decrease in impurity). This means that these factors are most influential in predicting temperature difference between the Bronx and Manhattan. This makes sense, since warmer air has a higher ability to absorb water vapor, leading to a tendency for higher humidity.

The high feature importance of humidity doesn't mean that it is cause of urban heat island effect, but it does show that humidity is a strong indicator for forecasting UHI intensity and temperature difference. An actionable takeaway that could result from this tutorial could be a project that utilizes humidity readers in conjunction with our model in order to notify citizens and data scientists on days that the urban heat island effect is most prevalant.

Another idea that results from this tutorial is that the structure of urban cities (pavement, roads, and buildings), has an effect on humidity. Since there is less vegetation and water sources, there is less water to readily evaporate into the air, decreasing humidity compared to rural areas. If we could conduct more research, we could look into the building footprints and the amount of paved surfaces in Manhattan, connecting the findings with the air temperatures.

Overall, our project delved extensively into the urban heat island effect, explaining the phenomenon and data science methodology to new readers, and showing informed readers surprising results, like the Bronx relative humidity being a driving factor behind predicting temperature differences. To revisit questions posed in the introduction, our model got very close to predicting actual ground temperatures, with a mean absolute error of under 0.2 degrees Celsius. Additionally, our tutorial examined the factors driving the urban heat island effect, showing the top predictors for difference in temperatures between the Bronx and Manhattan.